Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
نویسنده
چکیده
The set intersection problem is one of the main problems in document retrieval. Query consists of two keywords, and for each of keyword we have a sorted set of document IDs containing it. The goal is to retrieve the set of document IDs containing both keywords. We perform an experimental comparison of Galloping search and a new algorithm by Cohen and Porat (LATIN2010), which has a better theoretical time complexity. We show that the new algorithm has often worse performance than the trivial one on real data. We also propose a variant of the Cohen and Porat algorithm with a similar complexity but better empirical performance. Finally, we investigate influence of document ordering on query time.
منابع مشابه
Faster Exact Histogram Intersection on Large Data Collections Using Inverted VA-Files
Most indexing structures for high-dimensional vectors used in multimedia retrieval today rely on determining the importance of each vector component at indexing time in order to create the index. However for Histogram Intersection and other important distance measures this is not possible because the importance of vector components depends on the query. We present an indexing structure inspired...
متن کاملAn Effective Approach to Temporally Anchored Information Retrieval
We consider in this paper the information retrieval problem over a collection of time-evolving documents such that the search has to be carried out based on a query text and a temporal specification. A solution to this problem is critical for a number of emerging large scale applications involving archived collections of web contents, social network interactions, blog traffic, and information f...
متن کاملFast Sorted-Set Intersection using SIMD Instructions
In this paper, we focus on sorted-set intersection which is an important part in many algorithms, e.g., RID-list intersection, inverted indexes, and others. In contrast to traditional scalar sorted-set intersection algorithms that try to reduce the number of comparisons, we propose a parallel algorithm that relies on speculative execution of comparisons. In general, our algorithm requires more ...
متن کاملThe Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K-List Similarity Search
We consider the problem of processing similarity queries over a set of top-k rankings where the query ranking and the similarity threshold are provided at query time. Spearman’s Footrule distance is used to compute the similarity between rankings, considering how well rankings agree on the positions (ranks) of ranked items (i.e., the L1 distance). This setup allows the application of metric ind...
متن کاملScheduling Intersection Queries in Term Partitioned Inverted Files
This paper proposes and presents a comparison of scheduling algorithms applied to the context of load balancing the query traffic on distributed inverted files. We put emphasis on queries requiring intersection of posting lists, which is a very demanding case for the term partitioned inverted file and a case in which the document partitioned inverted file used by current search engines can perf...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013